We're still looking at sequential decision procedures, meaning if we look back at our agents,
we're still looking at our agents.
All right, well, we have this upper world modeling part and we're down here in the decision
part.
Now, essentially, we've done that before, but now we're doing it sequentially, meaning
in a sequential environment, meaning where the utility of an action depends on a sequence
of decisions.
We have to take time into account.
We were looking before at episodic environments where it essentially, where time passes,
but while we're deliberating, while the agent deliberates about what to do, it doesn't
really play a role.
Time doesn't play a role.
Here time plays a role.
We're trying to deal with this.
We're looking at Markov decision problems that's something we actually did last week.
Today we're going to graduate to Pupindy P's, namely partially observable Markov decision
procedures, MVP's Markov decision procedures is essentially the simplified case where we
have a totally observable environment.
We want to eventually make decisions in a partially observable world where we have uncertainty
and utility to deal with.
But if in the technical work we're doing, you feel a little bit unsatisfied because we're
always talking about utilities, never about actions and so on, that's actually intended.
We already know what we have to do given a utility.
We do just maximization of expected utility, which is basically the weighted sum over
utility of a result action.
And since we're uncertain times the probability that this action gets us into the state we
have the utility for.
Easy peasy.
The only difficult thing with time here is to actually find out what utilities are and
in particular to deal with a utilities of time sequences.
That's the first step we have to do.
And the second step equally important is to somehow get from time sequences to utility
of individual states.
Because maximizing utility is not something that has utilities of time sequences somewhere
baked into its genetics.
Now it wants utilities of single states.
So the program in a way for MDPs is to say, well, can't we just make utilities of single
states?
That's what we want.
And last week we've managed to define that.
I'm going to try and convince you that even if we've managed to define that, we don't
know how to compute that yet.
And that's what we're learning today.
Okay.
We're doing this bit here, planning we've already seen.
And we will see surprisingly that in POMDPs, plans and so on will make an appearance.
So everything is connected to everything.
Okay.
We looked at this four times three example where we had reliable sensing, but unreliable
actions and we had this reward function that basically says every time tick you get a
Presenters
Zugänglich über
Offener Zugang
Dauer
01:26:28 Min
Aufnahmedatum
2023-05-23
Hochgeladen am
2023-05-24 19:39:06
Sprache
en-US